-
Notifications
You must be signed in to change notification settings - Fork 415
optimize computation in librrgraph utils #2714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In one hot loop, swap three expensive FDIVs for a single FDIV and three FMULs.
@vaughnbetz I think this is ready to merge. It LGTM. |
Thanks! |
I wonder if the compiler would have factored this out ? No harm in any case to merge this ... |
@vaughnbetz I agree, I also think the compiler should handle this. The only thing that may not cause the compiler to do this optimization is precision. Division may yield a different result than multiplication by (1/x); therefore, if safe math compiler optimizations are turned on, the compiler cannot re-order the math operations. Since @heshpdx is working on the SPEC benchmark, it may be that safe math compiler optimizations are turned on to always give the same output regardless of optimization. That's my guess. As far as I can tell, VTR does not turn on the |
Good point; that makes sense. |
Thank you for merging! If For SPEC CPU, we do not build VPR with -Ofast because VPR uses std::isnan so everything would break. There may be some subset of fast math options that are safe, but we haven't explored those yet. The code in question also produces divides with -O3 which is my default for testing. We offer a small amount of tolerance in output which would accommodate such changes, if they actually did produce a different answer (in this case they didn't). This type of code is surprisingly common in the field, especially inside of for-loops. Take a look at my recent github history and you can see me tearing through applications offering this same perf optimization in droves. 😄 |
Actually, I should correct this statement. A month ago I stopped offering tolerance! All outputs for the SPEC CPU version of VPR now much match exactly regardless of compiler or arch, and indeed they do match through testing across a variety of systems. And this patch did not move the needle. |
That's good news! Is this with Amin's change to read in initial placement and use a power of 2 channel width in placement? |
Quick performance fix. In one hot loop, swap three expensive FDIVs for a single FDIV and three FMULs. Most CPUs execute FP multiplies magnitudes faster than FP divides.
No functional change seen, at least in my benchmarks.